Partial dependence profiles and accumulated local dependence

In [1]:
from profiles import train, X_train, Y_train, X_valid, Y_valid, column_names, treatment_col
from model import train_xgb_model, train_logistic, evaluate_uplift, simple_network, check_acc_diff, check_uplift_diff, local_search_xgb
from explanations import pdp_plot_uplift, ale_plot_uplift
train.head()
Dataset bias: 14.677734375000002% of positive answers 



<class 'pandas.core.frame.DataFrame'>
RangeIndex: 64000 entries, 0 to 63999
Data columns (total 12 columns):
 #   Column                Non-Null Count  Dtype  
---  ------                --------------  -----  
 0   recency               64000 non-null  int64  
 1   history               64000 non-null  float64
 2   mens                  64000 non-null  int64  
 3   womens                64000 non-null  int64  
 4   zip_code_Surburban    64000 non-null  int64  
 5   zip_code_Rural        64000 non-null  int64  
 6   zip_code_Urban        64000 non-null  int64  
 7   newbie                64000 non-null  int64  
 8   channel_Phone         64000 non-null  int64  
 9   channel_Web           64000 non-null  int64  
 10  channel_Multichannel  64000 non-null  int64  
 11  segment               64000 non-null  int64  
dtypes: float64(1), int64(11)
memory usage: 5.9 MB
Out[1]:
recency history mens womens zip_code_Surburban zip_code_Rural zip_code_Urban newbie channel_Phone channel_Web channel_Multichannel segment
0 10 142.44 1 0 1 0 0 0 1 0 0 1
1 6 329.08 1 1 0 1 0 1 0 1 0 0
2 7 180.65 0 1 1 0 0 1 0 1 0 1
3 9 675.83 1 0 0 1 0 1 0 1 0 1
4 2 45.34 1 0 0 0 1 0 0 1 0 1

This is an uplift modelling dataset with indicator variable TREATMENT coded as segment here, orignally consisting of 3 types - no communication, mens and womans communicator type. Here, for simplicity the latter two are merged into one (because we already have information about gender in other column). Columns represent (after feature eng):

  • recency - Months since last purchase.
  • history - Actual dollar value spent in the past year.
  • mens - customer purchased Mens merchandise in the past year
  • womens - customer purchased Womens merchandise in the past year
  • zip_code_Surburban - 0/1 indicator if suburban
  • zip_code_Rural - 0/1 indicator if rural
  • zip_code_Urban - 0/1 indicator if urban
  • newbie - 1 if new customer in the past twelve months
  • channel_Phone - purchased by phone
  • channel_Web - puchased by web
  • channel_Multichannel - above or more channels

It's easy to see, that above variables are surely highly corelated (history > 0 <-> newbie etc.). Some other variables were therefore removed.

Models and their scores

Robust XGBoost

In [2]:
r_xgb_model = local_search_xgb(X_train, Y_train, X_valid, Y_valid, treatment_col, just_get_model=True)
check_acc_diff(r_xgb_model, "Robust XGBoost", X_train, Y_train, X_valid, Y_valid)
check_uplift_diff(r_xgb_model, "Robust XGBoost", X_train, Y_train, X_valid, Y_valid, treatment_col)
Robust XGBoost train acc: 0.85345703125
Robust XGBoost valid acc: 0.8528125
Robust XGBoost train uplift score = 0.0594653287098299
Robust XGBoost valid uplift score = 0.03389569138409081

This is model found by hyper-parameters local search: Approximate local derivative of each hyper-parameter by sampling close values, cross-validate model with them and then go ascending way, loop over all parameters for some iterations until no change appears what means we hit local minimum. Models found by this method are highly robust, and as we see the scores on train and valid datasets are close (actually, here because of some random fluctuations, valid have better score). However, this cross-validation did not use accuracy or loss as metric, but had to use adjusted gains, calculated as gain of current model comparing to random model, normalized by maximum possible achievable score (visible on above plots respectively for train and valid dataset). This parameter can be also interpreted as a relative monetary profit from using this algorithm instead of random one.

In [3]:
pdp_plot_uplift(r_xgb_model, X_train, Y_train, treatment_col)
Preparation of a new explainer is initiated

data is converted to pd.DataFrame, columns are set as string numbers
  -> data              : 51200 rows 12 cols
  -> target variable   : 51200 values
  -> model_class       : xgboost.sklearn.XGBClassifier (default)
  -> label             : not specified, model's class short name is taken instead (default)
  -> predict function  : <function pdp_plot_uplift.<locals>.<lambda> at 0x7f1b84cf5550> will be used
  -> predicted values  : min = -0.35370606, mean = 0.058562264, max = 0.36117774
  -> residual function : difference between y and yhat (default)
Calculating ceteris paribus!:   0%|          | 0/12 [00:00<?, ?it/s]
  -> residuals         : min = -0.3611777424812317, mean = 0.08821507711727464, max = 1.3537060618400574
  -> model_info        : package xgboost

A new explainer has been created!
Calculating ceteris paribus!: 100%|██████████| 12/12 [00:00<00:00, 12.29it/s]
In [4]:
ale_plot_uplift(r_xgb_model, X_train, Y_train, treatment_col)
Preparation of a new explainer is initiated

data is converted to pd.DataFrame, columns are set as string numbers
  -> data              : 51200 rows 12 cols
  -> target variable   : 51200 values
  -> model_class       : xgboost.sklearn.XGBClassifier (default)
  -> label             : not specified, model's class short name is taken instead (default)
  -> predict function  : <function ale_plot_uplift.<locals>.<lambda> at 0x7f1b84cf5790> will be used
  -> predicted values  : min = -0.35370606, mean = 0.058562264, max = 0.36117774
  -> residual function : difference between y and yhat (default)
Calculating ceteris paribus!:   0%|          | 0/12 [00:00<?, ?it/s]
  -> residuals         : min = -0.3611777424812317, mean = 0.08821507711727464, max = 1.3537060618400574
  -> model_info        : package xgboost

A new explainer has been created!
Calculating ceteris paribus!: 100%|██████████| 12/12 [00:01<00:00,  9.65it/s]
/home/ludziej/.local/lib/python3.8/site-packages/tqdm/std.py:666: FutureWarning:

The Panel class is removed from pandas. Accessing it from the top-level namespace will also be removed in the next version

Calculating accumulated dependency!: 100%|██████████| 12/12 [00:02<00:00,  4.86it/s]

Unfortunately, we were not able to make names on plots work, but these ids correspond to those in the dataset description.

The first thing to comment should be that ALE plots are really close to PDP profiles, so model does not have (major) interactions. However this does not mean, that we can use some classifier without interactions, because here we model directly the uplift, which is difference between preditions and is itself an interaction.

Also worth noting is, that plot 11 represents Treatment, and since we substitute this variable in prediction process, explanation function cannot see any effect of tweaking this value independently (either way we just ignore it).

Indicator variables:

  • 0 recency
  • 1 history
  • 2 mens
  • 3 womens
  • 4 zip_code_Surburban
  • 5 zip_code_Rural
  • 6 zip_code_Urban
  • 7 newbie
  • 8 channel_Phone
  • 9 channel_Web
  • 10 channel_Multichannel

Indicator parameters in favor of no Treatment:

  • Nebwie (7) -0.025. So marketing campaigns have better impact on your previous customers, which is quite intuitive conclusion.

Indicator parameters in favor of Treatment:

  • Biggest influence can be achieved on customers who bought 'womens' products (3) +0.04.
  • (2) Also customers who we know that bought 'mens' products can be easier to convince (especially since this dataset is focused on campaign directed to specific gender). This does not stand in conflict with previos statement due to two reasons: any bought products mean that this person is not a "newbie" (7), and this means we still have more information about this customer to approach him in personalized manner (some products do not have any "gender" specified).
  • (9) +0.01 People who previously bought products using web. This might connect to the fact that these campigns are though emails.

Indicator parameters without significant impact:

  • Two channel types: Phone (8) and Multichannel (10), describing the way customer used to bought products previously. It can raise some concerns, that two from three exclusive variables have no impact and one have positive: Web(9).
  • All one-hot encoded zip code categories; Suburban (4), Rural (5), Urban (6), describing a dwelling-place, seem to have no effect at all. This seems reasonable that each of them dont have any impact.

History

Sum of previously bought products - History (1), as most distinctive variable makes the model prone to overfit. Its high fluctuations also reaffirm this thesis. Intuitively, function should ascend, stairs are probably due to tree-like nature of the classifier.

Recency

Days since last purchase (1). Plot provides some vague flat sinusoidal fluctuations with global maximum beeing first of local maxima. Intuitively, function should be concave, aiming to find some "sweet-spot" between time when customer "just went out from the shop" and "forgot about this". This plot can be interpretted in this manner, but also with some noise at bigger values, where classifier is littlebit overfitted to remember distinctive cases. Also, this plot is a valuable asset and can be used for choosing best time to run campaign for this customer.

Simple neural network

In [5]:
nn_model = simple_network(X_train, Y_train, X_valid, Y_valid)
check_acc_diff(nn_model, "Simple neural network", X_train, Y_train, X_valid, Y_valid)
check_uplift_diff(nn_model, "Simple neural network", X_train, Y_train, X_valid, Y_valid, treatment_col)
WARNING:tensorflow:Layer dense is casting an input tensor from dtype float64 to the layer's dtype of float32, which is new behavior in TensorFlow 2.  The layer has dtype float32 because it's dtype defaults to floatx.

If you intended to run this layer in float32, you can safely ignore this warning. If in doubt, this warning is likely only an issue if you are porting a TensorFlow 1.X model to TensorFlow 2.

To change all layers to have dtype float64 by default, call `tf.keras.backend.set_floatx('float64')`. To change just this layer, pass dtype='float64' to the layer constructor. If you are the author of this layer, you can disable autocasting by passing autocast=False to the base Layer constructor.

Simple neural network train acc: 0.8500781059265137
Simple neural network valid acc: 0.849609375
WARNING:tensorflow:From /home/ludziej/Apps/anaconda3/envs/mainenv/lib/python3.8/site-packages/tensorflow/python/keras/wrappers/scikit_learn.py:264: Sequential.predict_proba (from tensorflow.python.keras.engine.sequential) is deprecated and will be removed after 2021-01-01.
Instructions for updating:
Please use `model.predict()` instead.
Simple neural network train uplift score = 0.01739879704223802
Simple neural network valid uplift score = 0.0008970143422168395

This is overfitted small neural network model (relu, 3 layers with channels of size 50, batchnorm, learned with early stopping based on validation dataset loss), completly unable to learn robust features.

In [6]:
pdp_plot_uplift(nn_model, X_train, Y_train, treatment_col)
Preparation of a new explainer is initiated

data is converted to pd.DataFrame, columns are set as string numbers
  -> data              : 51200 rows 12 cols
  -> target variable   : 51200 values
  -> model_class       : tensorflow.python.keras.wrappers.scikit_learn.KerasClassifier (default)
  -> label             : not specified, model's class short name is taken instead (default)
  -> predict function  : <function pdp_plot_uplift.<locals>.<lambda> at 0x7f1b7c3ff4c0> will be used
  -> predicted values  : min = 0.015753955, mean = 0.043276466, max = 0.5116573
  -> residual function : difference between y and yhat (default)
Calculating ceteris paribus!:   0%|          | 0/12 [00:00<?, ?it/s]
  -> residuals         : min = -0.5097959637641907, mean = 0.10350087733073451, max = 0.9811519868671894
  -> model_info        : package tensorflow

A new explainer has been created!
Calculating ceteris paribus!: 100%|██████████| 12/12 [00:02<00:00,  4.40it/s]
In [7]:
ale_plot_uplift(nn_model, X_train, Y_train, treatment_col)
Preparation of a new explainer is initiated

data is converted to pd.DataFrame, columns are set as string numbers
  -> data              : 51200 rows 12 cols
  -> target variable   : 51200 values
  -> model_class       : tensorflow.python.keras.wrappers.scikit_learn.KerasClassifier (default)
  -> label             : not specified, model's class short name is taken instead (default)
  -> predict function  : <function ale_plot_uplift.<locals>.<lambda> at 0x7f1b7c445430> will be used
  -> predicted values  : min = 0.015753955, mean = 0.043276466, max = 0.5116573
  -> residual function : difference between y and yhat (default)
Calculating ceteris paribus!:   0%|          | 0/12 [00:00<?, ?it/s]
  -> residuals         : min = -0.5097959637641907, mean = 0.10350087733073451, max = 0.9811519868671894
  -> model_info        : package tensorflow

A new explainer has been created!
Calculating ceteris paribus!: 100%|██████████| 12/12 [00:02<00:00,  4.15it/s]
/home/ludziej/.local/lib/python3.8/site-packages/tqdm/std.py:666: FutureWarning:

The Panel class is removed from pandas. Accessing it from the top-level namespace will also be removed in the next version

Calculating accumulated dependency!: 100%|██████████| 12/12 [00:02<00:00,  4.61it/s]

Neural network model also do not seem to have any major interactions, also we got to keep in mind that this model have berely positive score on valid dataset. First thing that comes up to mind after looking at these plots is that the estimated value is different, however for our uplift score this should not be a concern -> we only sort observations in order basing on the estimations, so only relative probabilities are important.

Indicator variables

Patterns visible at previous model are repeated except dwelling-place. This model suggest that capaigns are less persuasive for people from more rural areas.

Recency

Counter-intuitively, this function is convex instead of beeing concave -> major difference between models here.

History

This model suggests, that the more client spends, the less important is campaign for him. This can be caused be reasoning that they are "sure buyers".

Jan Ludziejewski
In [ ]: